Efficient batchwise dropout training using submatrices

نویسندگان

  • Benjamin Graham
  • Jeremy Reizenstein
  • Leigh Robinson
چکیده

Dropout is a popular technique for regularizing artificial neural networks. Dropout networks are generally trained by minibatch gradient descent with a dropout mask turning off some of the units—a different pattern of dropout is applied to every sample in the minibatch. We explore a very simple alternative to the dropout mask. Instead of masking dropped out units by setting them to zero, we perform matrix multiplication using a submatrix of the weight matrix—unneeded hidden units are never calculated. Performing dropout batchwise, so that one pattern of dropout is used for each sample in a minibatch, we can substantially reduce training times. Batchwise dropout can be used with fully-connected and convolutional neural networks. 1 Independent versus batchwise dropout Dropout is a technique to regularize artificial neural networks—it prevents overfitting [8]. A fully connected network with two hidden layers of 80 units each can learn to classify the MNIST training set perfectly in about 20 training epochs—unfortunately the test error is quite high, about 2%. Increasing the number of hidden units by a factor of 10 and using dropout results in a lower test error, about 1.1%. The dropout network takes longer to train in two senses: each training epoch takes several times longer, and the number of training epochs needed increases too. We consider a technique for speeding up training with dropout—it can substantially reduce the time needed per epoch. Consider a very simple `-layer fully connected neural network with dropout. To train it with a minibatch of b samples, the forward pass is described by the equations: xk+1 = [xk · dk]×Wk k = 0, . . . , `− 1. 1 ar X iv :1 50 2. 02 47 8v 1 [ cs .N E ] 9 F eb 2 01 5 Here xk is a b × nk matrix of input/hidden/output units, dk is a b × nk dropout-mask matrix of independent Bernoulli(1 − pk) random variables, pk denotes the probability of dropping out units in level k, and Wk is an nk ×nk+1 matrix of weights connecting level k with level k + 1. We are using · for (Hadamard) element-wise multiplication and × for matrix multiplication. We have forgotten to include non-linear functions (e.g. the rectifier function for the hidden units, and softmax for the output units) but for the introduction we will keep the network as simple as possible. The network can be trained using the backpropagation algorithm to calculate the gradients of a cost function (e.g. negative log-likelihood) with respect to the Wk: ∂cost ∂Wk = [xk · dk] × ∂cost ∂xk+1 ∂cost ∂xk = ( ∂cost ∂xk+1 ×W k ) · dk. With dropout training, we are trying to minimize the cost function averaged over an ensemble of closely related networks. However, networks typically contain thousands of hidden units, so the size of the ensemble is much larger than the number of training samples that can possibly be ‘seen’ during training. This suggests that the independence of the rows of the dropout mask matrices dk might not be terribly important; the success of dropout simply cannot depend on exploring a large fraction of the available dropout masks. Some machine learning libraries such as Pylearn2 allow dropout to be applied batchwise instead of independently1. This is done by replacing dk with a 1×nk row matrix of independent Bernoulli(1−pk) random variables, and then copying it vertically b times to get the right shape. To be practical, it is important that each training minibatch can be processed quickly. A crude way of estimating the processing time is to count the number of floating point multiplication operations needed (naively) to evaluate the × matrix multiplications specified above:

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

QBDC: Query by dropout committee for training deep supervised architecture

While the current trend is to increase the depth of neural networks to increase their performance, the size of their training database has to grow accordingly. We notice an emergence of tremendous databases, although providing labels to build a training set still remains a very expensive task. We tackle the problem of selecting the samples to be labelled in an online fashion. In this paper, we ...

متن کامل

Dropout distillation

Dropout is a popular stochastic regularization technique for deep neural networks that works by randomly dropping (i.e. zeroing) units from the network during training. This randomization process allows to implicitly train an ensemble of exponentially many networks sharing the same parametrization, which should be averaged at test time to deliver the final prediction. A typical workaround for t...

متن کامل

Alpha-Divergences in Variational Dropout

We investigate the use of alternative divergences to Kullback-Leibler (KL) in variational inference(VI), based on the Variational Dropout [10]. Stochastic gradient variational Bayes (SGVB) [9] is a general framework for estimating the evidence lower bound (ELBO) in Variational Bayes. In this work, we extend the SGVB estimator with using Alpha-Divergences, which are alternative to divergences to...

متن کامل

To Drop or Not to Drop: Robustness, Consistency and Differential Privacy Properties of Dropout

Training deep belief networks (DBNs) requires optimizing a non-convex function with an extremely large number of parameters. Naturally, existing gradient descent (GD) based methods are prone to arbitrarily poor local minima. In this paper, we rigorously show that such local minima can be avoided (upto an approximation error) by using the dropout technique, a widely used heuristic in this domain...

متن کامل

Manifold Regularized Deep Neural Networks using Adversarial Examples

Learning meaningful representations using deep neural networks involves designing efficient training schemes and well-structured networks. Currently, the method of stochastic gradient descent that has a momentum with dropout is one of the most popular training protocols. Based on that, more advanced methods (i.e., Maxout and Batch Normalization) have been proposed in recent years, but most stil...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1502.02478  شماره 

صفحات  -

تاریخ انتشار 2015